Sentence boundary detection of spontaneous Japanese using statistical language model and support vector machines
نویسندگان
چکیده
This paper presents two different approaches utilizing statistical language model (SLM) and support vector machines (SVM) for sentence boundary detection of spontaneous Japanese. In the SLM-based approach, linguistic likelihoods and occurrence of pause are used to determine sentence boundaries. To suppress false alarms, heuristic patterns of end-of-sentence expressions are also incorporated. On the other hand, SVM is adopted to realize robust classification against a wide variety of expressions and speech recognition errors. Detection is performed by an SVMbased text chunker using lexical and pause information as features. We evaluated these approaches on manual and automatic transcription of spontaneous lectures and speeches, and achieved F-measures of 0.85 and 0.78, respectively.
منابع مشابه
Sentence boundary detection using sequential dependency analysis combined with CRF-based chunking
In spoken language, sentence boundaries are much less explicit than in written language. Since conventional natural language processing (NLP) techniques are generally designed assuming the sentence boundaries are already given, it is crucial to detect the boundaries accurately for applying such NLP techniques to spoken language. Classification frameworks, such as Support Vector Machines (SVMs) ...
متن کاملDependency structure analysis and sentence boundary detection in spontaneous Japanese
This paper addresses automatic detection of dependencies between Japanese phrasal units called bunsetsus, and sentence boundaries in a spontaneous speech corpus. In spontaneous speech, the biggest problem with dependency structure analysis is that sentence boundaries are ambiguous. In this paper, we propose two methods for improving the accuracy of sentence boundary detection in spontaneous Jap...
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملDetection of sentence boundaries and abbreviations in clinical narratives
BACKGROUND In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms. METHODS The problem is addressed by two binary classifiers for abbreviation and sen...
متن کاملFault diagnosis in a distillation column using a support vector machine based classifier
Fault diagnosis has always been an essential aspect of control system design. This is necessary due to the growing demand for increased performance and safety of industrial systems is discussed. Support vector machine classifier is a new technique based on statistical learning theory and is designed to reduce structural bias. Support vector machine classification in many applications in v...
متن کامل